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Algebraic Bayesian analysis of contingency tables 
with possibly zero-probability cells 
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Abstract: In this paper we consider a Bayesian analysis of contingency tables al- 
lowing for the possibility that cells may have probability zero. In this sense we 
depart from standard log-linear modeling that implicitly assumes a positivity con- 
straint. Our approach leads us to consider mixture models for contingency tables, 
where the components of the mixture, which we call model-instances, have distinct 
support. We rely on ideas from polynomial algebra in order to identify the various 
model instances. We also provide a method to assign prior probabilities to each 
instance of the model, as well as describing methods for constructing priors on the 
parameter space of each instance. We illustrate our methodology through a 5 x 2 
table involving two structural zeros, as well as a zero count. The results we ob- 
tain show that our analysis may lead to conclusions that are substantively different 
from those that would obtain in a standard framework, wherein the possibility of 
zero-probability cells is not explicitly accounted for. 

Key words and phrases: Algebraic statistics; Bayes factor; Compatible priors; Expo- 
nential family; Log-linear model; Model-instance; Positivity constraint; Structural 
zero; Toric model. 

1. Introduction 

The analysis of contingency tables has a well established tradition, both in 
the frequentist and Bayesian setting. A typical framework for this analysis is rep- 
resented by the exponential family representation of the sampling distribution, 
together with the log-linear, or more generally log-affine, model for the expected 
cell count, see Lauritzen (1996, ch. 4) for a rigorous treatment. Under multi- 
nomial sampling, this approach presupposes implicitly that cell-probabilities, 
equivalently cell-expected counts, are strictly positive. On the other hand, this 
assumption is not particularly justified from a substantive viewpoint; indeed, as 
we shall argue below, it might well hide some interesting aspects of modeling. 

Typically, the positivity constraint is viewed as problematic when performing 
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Maximum Likelihood Estimation (MLE) in a log-linear framework if there are 
some cells having zero counts, see for instance the discussion in Christensen 
(1997, ch. 8). One usually distinguishes between structural (or "fixed") zeros, 
and random (or "sampling") zeros. The former arise when the cells are logically 
forced to have a zero-count. Consider for instance a cross-classification for people 
where the personal highest educational attainment (Less than high school, High 
school, College, Postgraduate) is recorded at a given time, and five years later. 
Clearly it is impossible for someone to have a highest attainment of College on 
the first time point, and Less than high school or High school five years later; 
in general every cell that corresponds to lower attainment at the second time 
period compared to the first time period is a structural zero. On the other hand, 
random zeros are typically thought to occur either because the sample size, or 
the corresponding cell probability, or both are "small", as it occurs in sparse 
contingency tables. 

Structural zeros are typically dealt with by removing them altogether from 
the analysis. One way to do this is through regression models on effect codings, 
see e.g. Simonoff (2003, sect. 6.4). Random zeros on the other hand require 
special handling. Essentially one should first identify those cells for which the 
regular MLE of the cell-probability does not exist, i.e. is zero (this requires 
special care as such cells need not coincide with those having zero counts), and 
then remove them from the analysis. In any case the computation of the degrees 
of freedom for model testing must be done on a case by case basis, and requires 
some ingenuity. Another difficulty generated by the presence of random zeros 
is that asymptotic arguments may effectively break down because of the small- 
sample size, although some computer programs may still provide MLEs when 
they actually do not exist. For an informative account of the above problems 
see Haberman (1974), Bishop et al. (1975, sect. 5) and Christensen (1997, sect. 
8.3). Recently Eriksson et al. (2006) have provided a polyhedral description of 
the conditions for the existence of the MLE for a hierarchical log-linear model 
together with an algorithm for determining if the MLE exists. 

In this paper we take the view that modeling of contingency tables should 
allow explicitly for the possibility of zero-probability cells not only to deal with 
structural zeros but also with zero-counts whose nature is undecided, in the 
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sense that their occurrence may be consistent with either a zero probability or a 
positive probability: we call these cells possibly zero-probability cells. 

An early paper that takes a similar view is Lauritzen (1975), although the 
techniques used there are quite different from the ones that we employ here. 

From a modeling perspective, we contend that, for each given model, the 
usual exponential-family/log- linear representation of the sampling distribution is 
simply one instance of such model, while several other instances are conceptually 
consistent with the assumed model, each being essentially a log-linear model 
with a restricted support. The identification of such instances represent a crucial 
aspect in the implementation process, and is typically of high complexity. 

In our work we rely on ideas from polynomial algebra and the related ge- 
ometric and combinatorial structure, which have been recently applied to the 
analysis of some classes of (finitely) discrete statistical models. In particular, 
Eriksson et al. (2006) deal with hierarchical log-linear models, while Geiger et 
al. (2006) discuss graphical models. 

Our approach falls broadly under the heading of Algebraic Statistics, see 
Pistone et al. (2001) for an early general account, as well as the pioneering work 
of Diaconis and Sturmfels (1998). The field is now growing at an impressive speed 
both in terms of theoretical contributions and applications, see for example the 
recent monograph by Pachter and Sturmfels (2005). Further useful references are 
Geiger et al. (2001), who develop the concept of stratified exponential families, 
as well as Garcia, Stillman and Sturmfels (2005) who carry out the analysis 
of Bayesian networks from an algebraic statistical perspective. Rapallo (2006) 
discusses some basic algebraic statistics tools that deal explicitly with models 
for contingency tables and represents a simple and useful introduction to this 
paper. Our interest in the use of algebraic methodology for statistical purposes 
was stimulated by the availability of various symbolic computational software: 
here we use CoCoA developed and maintained at the University of Genova, Italy. 
An other option could be the softare 4ti2. 

A specific feature of this paper is the combination of methods from algebraic 
statistics with the Bayesian approach. Specifically, we shall deal with issues like 
the assignment of a prior on model space, prior elicitation on the parameter 
space under each model, or instances thereof; together with model choice using 
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the Bayes factor, see Kass and Raftery (1995) for a review. 

The paper is organized as follows: Section 2 contains some basic tools from 
algebraic statistics that are used in the paper; in Section 3 such tools are applied 
to a real data-set; Section 4 is the core of the paper, presenting a Bayesian ap- 
proach testing quasi-independence in two-way contingency tables using a mixture 
of model- instances, thus accounting for the possible presence of zero-probability 
cells. Finally, Section 5 summarizes the paper and presents some points for 
discussion. 

2. Algebraic statistical models 

Consider a finite state space Q and a probability distribution on Q, which we 
can write as {p(x), x G Q}, with p(x) > and Y1x&qP( x ) = 1. In particular, we 
shall deal with multi-way contingency tables identified by a collection of factors 
X = {X\, . . . , Xp}. If If denotes the set of levels for the factor Xf , f = 1, . . . , F, 
the state space is a product space, i.e Q = xj =1 lf. 

A log-linear model assumes that p(x) > and that logp(x) belongs to a 
linear subspace H of L = M®, where denotes as usual the vector space of 
real- valued functions on Q. If H is spanned by {T\, . . . ,T S }, where the Tj's are 
integer valued functions, we can write the log-linear model as 

s 

logp(x) =J2Qog( j )T j (x), (2.1) 
j'=i 

with ^2 x p{x) = 1. Recall that (|2.ip assumes strict positivity of p(x). However 
the latter is no longer needed if we rewrite (12. ip as 

q(x) = (™...g°W, Ci>0, j = 0,...,s, (2.2) 

where q(x) is the un-normalized probability, so that the parameters Ci) • • • > Cs are 
only subject to non-negativity constraints. Notice that (12.2() is, for each x G Q, 
a (monic) monomial in the indeterminates Ci? • • ■ > Cs- When x scans Q, we get a 
system of binomial equations and so (|2.2p could also be called a parametric toric 
model, borrowing terminology from commutative algebra, see Sturmfels (1996), 
as suggested in Pistone et al. (2001). 

When the cell probabilities are assumed to be strictly positive, then the log- 
linear model (|2.ip and the toric model (|2.2p can be easily shown to be equivalent. 
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A third expression of the same model can be derived by elimination of the in- 
determinates £i , . . . , Cs in the monomial parameterization of equation (12. 2p . In 
fact, if M = T\{x) ■ ■ ■ T s (x) is the design matrix of the log-linear model 

L J xgQ 

of equation (|2.ip . the orthogonal space of its range can be generated by integer 

Ah * * * kr 



, and equation (|2.2p gives for 



valued vectors with zero sum K 
each j = 1, . . . ,r 



\[ q {xf^ = ii (Ci {x) ■ ■ ■ cl s{x) ) = ci 1(xykm ■ ■ ■ ci^ xykm = i, (2.3) 

X X 

where the dot symbol "-"denotes scalar product. 

As the sum of the elements of each kj, j = 1, . . . , r, is zero, the sum of the 
elements of both the positive part and the negative part kj are equal, so that 
we could write equation f|2.3[) as 



II ^) k]{x)+ - II ^) k](xr =0, 3 = 1, • • • , r. (2.4) 



It follows that the toric model (|2.2j) implies a set of r binomial and homoge- 
neous equations in the un-normalized probabilities q(x), x G Q. 

If the probabilities are assumed to be strictly positive, then the three descrip- 
tions, i.e. log-linear (|2.ip . toric (12. 2p and implicit binomial (12. 4p . are equivalent. 
We remark that while (12. ip and (|2.2p are parametric models, the nature of (12. 4p 
is essentially non-parametric. When the positivity assumption is relaxed, a non 
trivial situation occurs. The basic fact is that different toric parameterizations 
can lead to the same implicit binomial, because they are equivalent only on the 
strictly positive part of the model. However, the implicit binomial equations 
are satisfied by all limits of the positive cases; thus the implicit binomial is the 
best expression of the so called extended exponential model, i.e. the exponential 
model plus all its limits. 

We summarize here a few basic facts of the theory of toric statistical models. 
Given a log-linear model and all its limit points, a specific set of configurations 
of zero-probability cells arises. This set cannot be recovered by setting to zero 
some parameters in a generic toric parametric representation, because most of 
the equivalent toric representations will not produce all possible probabilities of 
the model in Equation (|2.4p . However, there exists a "maximal" parametric toric 
representation, such that all configurations of zero-probability cells compatible 
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with, i.e. limit of, the initial model are obtained by letting some parameters be 
zero. Such representation results from the following steps: 

1. All toric models compatible with the implicit binomial model (|2.4|) are char- 
acterized by a string of T's exponents, see (|2.2p . which is a non-negative 
integer vector orthogonal to the basis \k\ . . . k r ] of the orthogonal space of 
the initial design matrix M. 

2. The lattice of non- negative integer vectors t G such that the condition 
t ■ kj = holds for each j = 1, . . . , r, has a finite number of generators that 
can be computed with symbolic software. Here "generator" means that 
all such vectors are component- wise sums of a finite number of generators, 
possibly repeated. The minimal set of generators is called minimal Hilbert 
basis. 

3. If the generators are Si, . . . , S u , then the "maximal" toric model is 

q(x) = (i l{x) ---& {x) xeQ. (2.5) 

Here "maximal" means that (|2.5p is a (possibly non-identifiable) parameteriza- 
tion of the full implicit binomial model, i.e. the extended model. All members 
of the implicit model (|2.4p with zero-cell probabilities are obtained by letting 
some Cj-'s be zero. Assume e.g. we let £i = 0. Then the support of the resulting 
probability will be the set Qi = {x £ Q : Si(x) = 0}. On such a restricted 
support, the model will be again toric: 

?(*) = (2 2(X) ■ ■ ■ & {X) *€&. 

or exponential if all the other parameters fa, • • ■ , Cu are assumed to be strictly 
positive. In this sense, we say that each toric model is a union of exponential 
models with different supports. Each one of these models is called an instance 
of the model. 

Current symbolic software allows to compute, for a given parametric model, 
the set of corresponding implicit binomial descriptions. Moreover, the collection 
of allowable models obtained by setting some cell probabilities equal to zero can 
be identified in terms of the functions Tj(x), see Geiger et al. (2006) and Rapallo 
(2006). 
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3. Example: new cancer incidence and gender 

We now turn to the discussion of a real example involving both structural 
and random zeros. Our analysis aims primarily at illustrating the main features 
of our method. 

The Division of Cancer Prevention and Control of the National Cancer In- 
stitute in the United States provides (estimates of) counts of new cases of cancer 
classified according to various demographic and geographic factors, see Simonoff 
(2003, p. 226). The following table reports data for different types of cancer 
separated by gender for Alaska in year 1989. 



Type of cancer 


Female 


Male 


Total 


Lung 


38 


90 


128 


Melanoma 


15 


15 


30 


Ovarian 


18 


* 


18 


Prostate 


* 


111 


111 


Stomach 





5 


5 


Total 


71 


221 


292 



Clearly cells (3, 2) and (4, 1) are structural zeros, while we regard the zero 
count corresponding to the combination (Stomach, Female) as a possibly zero- 
probability cell. A typical assumption that is of interest in this case is that of 
quasi-independence (QI), corresponding to the standard independence assump- 
tion for all cells, excluding those having a structural zero. For this hypothesis, 
Simonoff (2003, p. 228) finds a p-value between 2% and 3%, depending on the 
method that is employed. Using a conventional frequentist interpretation, the 
data thus seem to provide significant evidence against the Q/-model, although 
this evidence is not very strong. 

Let I = {1,2,3,4,5}, J = {1,2} denote the set of levels for the rows and 
columns respectively, and consider the two-way table with cells in the set A = 
I x J \ {(3, 2), (4, 1)}, i.e. with cells (3, 2) and (4, 1) missing. 

Under the QI-model the un-normalized cell probabilities qij are given by 

Qij = Piipj, G A. (3.1) 

If the probabilities are strictly positive, one can take the logarithm of (|3.ip . 
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obtaining 

loggjj = oti + Pj, e A 

with a, = log pi, (3j = log tpj. Accordingly the design matrix M, together with a 
suitable choice of an orthogonal matrix K, as described in Step 1 of Section 2, 
are 



M = 





a i 


Q!2 


«3 


a i 


015 


01 


ft 




fci 


k 2 


11 


' 1 














1 


" 


11 


1 





21 





1 











1 





21 


-1 


-1 


31 








1 








1 





31 








51 














1 


1 





51 





1 


















K = 






12 


1 

















1 


12 


-1 





22 





1 














1 


22 


1 


1 


42 











1 








1 


42 








52 














1 





1 


52 





-1 



One can check that, under the condition > 0, £ A, the model of 

quasi-independence in (|3.1|) is equivalent to the implicit binomial model given by 
the two constraints 

911922 - 921912 = 
951922 - 921952 = 0. 

The above equations are the standard conditions for independence in the two 
2x2 tables with rows {1,2}, respectively {2,5}. This is equivalent to the inde- 
pendence of the sub-table {1, 2, 5} x {1, 2}, since independence for an R x C-table 
is equivalent to the its 2 x 2 minors being zero. 

The maximal design matrix M max and the model in monomial form, see 
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(1231) . are 





Ci 


C-2 


Ca 


Ci 


c 5 


Ce 


(l 




= CsCt 






















11 


" 











1 





1 " 








21 








1 











1 




921 


= CsCt 


31 


1 






















931 


= Ci 


51 











1 








1 




951 


= U(l 


12 














1 


1 





< 


912 


= CsCe 


22 








1 








1 







922 


= CsCe 


42 





1 





















= c 2 




















942 


52 


_ 








1 





1 























> 952 


= C 4 Ce 



(3.3) 



Notice that the cells associated to a structural zero in the same row are param- 
eterized independently from the rest of the table. If we take out these cells, we 
simply get the full independence model on the sub-table with rows {1, 2, 5}. 

The instances for the Ql-model are computed by considering the (2 3 — 1)(2 2 — 
1) = 21 instances corresponding to independence in the 3x2 sub-table, times 
the 2 2 = 4 instances of the two free cells, plus the (2 2 — 1) instances where the 
3x2 sub-table is zero. The total is 87. 

4. Testing quasi-independence in the new cancer data 

We provide a Bayesian analysis of these data using the methodology devel- 
oped in the previous sections. We refer to the model which imposes no restriction 
on the cell probabilities, save the zero-probability cells (3,2) and (4,1), as the 
Structural Zero model and label it with the symbol SZ. Since the table has 10 
probability cells, of which 2 are fixed to be zero, the number of S'Z-instances is 
equal to 2 8 — 1 = 255 corresponding to all possible combinations of "+" and "0" 
in the 8 free cells, excluding the trivially impossible case of all "0" . 

Moreover, only two of the above SZ-instances are logically consistent with 
the observed data: that giving a positive probability to all eight free cells; and 
that giving zero-probability to cell (5, 1) only. We label these instances SZq and 
SZ\, where the subscript refers to the number of zero-probability cells, corre- 
sponding to the tables: 
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S Zq S Z\ 



Type of cancer 


Female 


Male 


Female 


Male 


Lung 


+ 


+ 


+ 


+ 


Melanoma 


+ 


+ 


+ 


+ 


Ovarian 


+ 





+ 





Prostate 





+ 





+ 


Stomach 


+ 


+ 





+ 



Similarly, for the given data, it is not difficult to realize that there exists 
only one logically consistent instance of the quasi-independence model, i.e. that 
having all positive cell-probabilities (except for the two cells corresponding to 
structural zeros), which we label QIq and is schematically equivalent to SZq 
above. 

4.1. Conventional approach We test the model of quasi-independence 
against the structural-zero model using a Bayesian approach. In a "conventional 
setting", wherein no particular provision for zero-probability cells is envisaged, 
we would simply consider one instance for each of the above two models, namely 
SZq and QIq. 

Given the cell counts n = (n^), a typical analysis would involve the com- 
putation of the Bayes factor, see Kass and Raftery (1995), of QIq versus SZq, 
i.e. 

BF(QI • SZq) = / fQh{ n \ e Ql Q )^Ql { e Ql Q ) de Qh = m QIo (n) 
J fszo{n\0szo)^sz o {dszo)d6sz o ™sz ( n )' 



where 



fsz is the multinomial sampling distribution under SZq, with cell-probabilities 
&sz = {hi) G A, and similarly for /qj under the quasi-independence 
model, whose cell-probabilities are denoted by 9qi ; 

7rsz Q and ttqi q are the prior densities for 9sz , respectively 0qi o ; 

m-SZo denote the marginal distribution of n under SZq, and similarly for 

■mQi . 



To obtain the posterior probability of model QIq one should provide, in addition, 
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its prior probability pqi = Pr(QIo), leading to 

p , nm PQIo BF(QI :SZ ) 

Pr(Q7 n) = 77757777 07 \ i ' 42 ) 

where psz = Pt(SZ ) = 1 - p Q / . 

A Bayesian analysis of this problem would take the prior ttsz to be Dirichlet, 

i.e. 

9sz ~ Di(a), (4.3) 

with a = (cxij) and ct^ > 0, see e.g. Bernardo and Smith (1994, p. 134-5 and 
441) and O'Hagan and Forster (2004, chapter 12). As a consequence, msz (n) 
is a Multinomial-Dirichlet with distribution 



. Nl H A {a 
msz {n) = =; : x 



U(i,j)eA n ir H A {a*Y 

n = (n ij ), nij = 0,1,..., AT, ^ nij = N (4.4) 

where 

llteT r (yt) 

and a* = a + n. 

Consider now the quasi-independence model QIo, and in particular the choice 
of the prior ttqi . This presents some conceptual and practical challenges, that 
we now try to elucidate. Although, in principle, priors under distinct models need 
not be related, as they express prior beliefs conditionally on different states of 
information, it is nevertheless desirable that they should be related at least when 
models are nested within an encompassing model. Pragmatically, this would sim- 
plify the elicitation task, since one would only assign a prior on the parameter 
under the latter model, and then derive the corresponding priors under each of 
the remaining models from this single prior. This procedure should also achieve 
some sort of internal "compatibility" among prior specifications. A general dis- 
cussion of strategies for building compatible priors under several related models 
is contained in Dawid and Lauritzen (2001). Further discussion, elaboration and 
references may be found in Consonni et al. (2005), and in Consonni and Veronese 
(2006). 
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Before turning to model QIo, it is expedient to rewrite the joint distribution 
of the counts riij, (i, j) 6 A, for the SZo-model as 

fsz {n\9) = fsz o ,i{n(i)\0) x fsz ,2(n^)\n {1) ,9), (4.5) 

where 

n(t) = (n 31 ,n i2 ,N - n 31 - n 42 ) 

n {2 ) = K:(*,i)€A\{(3,l),(4,2)}) 

Since, for (i, j) G A, the joint distribution of n = (n,ij), under SZq, is multinomial 
with size N and vector of probabilities 9 = (Oij), written Mu(iV; 9), it is easy to 
check that fsz ,i(n(i)\9) is a Mu(iV;A) with 

Al = #31> A2 = #42, A3 = 1 — Ai — A2, 

while fsz , 2(^(2) I "(l))^) is given by Mu(iV - n 31 - n 42 ;7), where 

7« = -I ft S = v ' G A \ ^ ^ ^ 

L-V31- 2^(ij)eA\{((3,l),(4,2))} 

The parameters A and 7 are variation independent, i.e. their joint range is the 
product of the two individual ranges. 
Under QIq we must have 

7ij = 7j+7+i , {hj) e A \ {(3,1), (4, 2)}, 

where 

7i+=7il+7i2, * = 1,2, 5 
7+j = 7ij + 72j + 75 j, i = 1, 2. 

Let 7^ denote the collection of 7 i + , and 7 c that of 7+ j . Then the distribution 
of the counts n under QIq can be written as 

fQl (n\X,jR,lc) = fQl ,i(n( 1 )\X)fQi ,2( n (2)\n(iy,lR,1c), (4.6) 

where fQi ,i(nm |A) is Mu(iV; Ai, A2, A3) and so coincides with the expression of 
/sz o ,l( n (l)|0) in (S3])! while /Q/ ,2(w(2)|n(i),7fl 5 7c) is given by 

r t 1 \ (-A^ ^-42)' 71.14. n2+ ns+ ft+i ri+2 / /i n\ 

/<3/o,2(»»(2)|n(i);7B,7c) = n X 7i| + 7 2 + 7 5 + x 7+1 7+2 ( 4 - 7 ) 

ll(i,i)eA\{(3,l),(4,2)r lii !} 
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where h + j = n\j + n 2 j + n^j . 

One can thus see that under QIq the joint distribution factors into three 
terms, one involving A, one involving 7^ and one involving 7c. 

Consider now the prior distribution. Given that Osz ~ ^K a sz ) we first 
remark that A and 7, are independent, because of ii) of Lemma[H see Appendix; 
as a consequence we also get that A is independent of the pair (tr,7c)- Fur- 
thermore 7 ~ Di(a.ij, £ A \ {(3, 1), (4, 2)}), so that 7^ ~ Di(a^) and 
7c* ~ Di(ac), where ocr and ac are defined in accordance with 7^ and 7c, re- 
spectively. Assuming independence of 7^ and 7c* makes the computation of the 
marginal distribution mQ; (n) straightforward since we can integrate separately 
the three terms in (|4.6h , see also (14. 7ft , each integral being, up to the multinomial 
coefficient, of type Multinomial-Dirichlet. 

Specifically we get 

AH 



m, 



QI ( n ) 



F(a 3 i,a 4 2,a + - 0:31 - a 42 ) 
H(al 1 ,a\ 2 ,a* + - a| x - a% 2 ) 
x H(a 1+ ,a 2+ ,a 5+ ) x H(a +1 ,a +2 ) 
H{a* l+ ,a* 2+ ,al + ) H(a* +1 ,a* +2 )' 

where a + = J2(ij)eA = a ij + a 2j + a 5j, "+-,• = ai J + n i? + " 2 i + n2 i + 

a 5j + n 5j . 

4.2. Allowing for zero-probability cells Phylosophically, we earnestly 
take the view that each instance of a model must be assigned a-priori a positive 
probability: in this sense we completely adhere to the principle that Lindley 
(1985, p. 104) names "Cromwell's rule". This leads us naturally to the idea of 
regarding a model M as a finite mixture of its instances. This aspect represents 
a characterizing feature of our approach to the analysis of contingency tables. 

We can thus write the mixture representation of M. as 

fM(n\9 M ) = ^2qM h fM h {n\9M h ), (4.9) 
h 

where 9m is the collection of all instance-specific parameters 9m h an d QM h is the 
prior probability attached to instance Mh- 

Specializing (|4.9p to the SZ and QI model, and then computing the marginal 
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Table 4.1: Prior probabilities qsz , qsz ± , lQi a f° r selected values of £ 





Qsz 


lsz x 


QQi 


0.1 


0.43 


0.05 


0.78 


0.2 


0.17 


0.04 


0.51 


0.3 


0.06 


0.03 


0.23 


0.4 


0.02 


0.01 


0.07 


0.5 


0.00 


0.00 


0.01 



distribution of the data under each model, leads to the Bayes factor 



BF(QI : SZ) 



q Q l m QIo {n) 



(4.10) 



qsz m SZo (n) + qsz 1 m S z 1 (n)' 

Let us now consider in detail the computations that are needed for the eval- 
uation of BF(QI : SZ). Let £ G (0, 1) be the chance that a cell has zero proba- 
bility, and assume that the allocation of zero probability to each cell takes place 
independently. Then, we can derive qsz and qszi , and obtain 

fi-0 8 



Qsz 



qsz 1 



i-e 8 ' 



(4.11) 
(4.12) 



Consider now the assignment of qQi - We recall that we have 87 instances with 
total probability C(£), then 



QQI 



(4.13) 



Table 14.11 reports the value of qsz , qsz 1 , qQi for selected choices of £ (for 
values of £ above 0.5, the values are zero to two decimal places). 

We now consider the marginal distribution of the data under the SZ\- 
instance. The conditioning method of Lemma [lj item ii), leads immediately 
to conclude that 8 S z 1 ~ D^as^J, where a SZl = ("y, G ^4 \ {(5,1)}), 

whence mg^i has an expression analogous to that of msz , the only difference 
being that now the set over which the indexes vary is A \ {(5, 1)}. 

For given £ and a, the Bayes factor BF(QI : SZ) can now be computed 
using (14.101) . Notice that the multiplicative term — — r appears both in the 

1 L(ij)£A ni 3- 
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numerator and denominator of (|4,10p . and so cancels out (strictly speaking the 
product for the instance SZ\ is over a set that does not contain (5, 1): however 
since nsi = the result is the same whether this value appears or not). 

Consider first the assignment of £;, which represents the chance that a cell 
has probability zero. Save for the case of a structural zero, it seems reasonable 
that we should assign a low value to £, since the corresponding event should be 
regarded a priori as a rather unusual circumstance. In view of Table I4~T1 setting 
£ = 0.1 seems a sensible choice. Indeed, while the prior probability of model QI 
is higher than that of SZ, nevertheless the discrepancy between the two values 
(0.78 against 0.48) is less pronounced for this choice of £ than for other choices, 
so that the comparison between the two models is fairer. 

We now take into consideration the choice of a. Unless there exists substan- 
tive prior information allowing to discriminate a priori between cells, we shall 
choose the same value a for each a™; also low values of a are typically recom- 
mended, whenever prior information is weak. Natural choices are represented by 
a = 0.5, corresponding to Jeffreys prior, or a = 1, corresponding to a uniform 
prior on the simplex. 

We now provide a method for the choice of a, using the technique of the 
imaginary training sample. This method has been implemented for instance by 
Spiegelhalter and Smith (1980) to deal with model choice using improper priors. 
We believe however that the idea can be usefully applied also in the context of 
proper priors, see Consonni et al. (2005) for a similar elaboration. 

Consider for simplicity only the models SZq and QIq. Suppose we can iden- 
tify a minimal imaginary training sample that provides maximal support (irre- 
spective of the prior) to model QIq. Then it is reasonable to require that the 
Bayes factor for these fictitious data should be approximately 1, i.e. the models 
are "equally likely" in terms of the empirical evidence. To see why this should 
be the case, notice that, on the one hand the data actually support QIq very 
strongly; on the other hand, the sample size is so small that the evidence in 
favor of either model should be roughly the same. The condition that the Bayes 
factor should be equal to 1 can be employed to select reasonable values for the 
hyper-parameters of the prior distribution. 

Consider the situation in which we have 1 observation in each cell, for a 
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total of 8 observations. It is straightforward to verify that this table is perfectly 
consistent with the Qlo-model: in particular the actual and fitted counts (the 
latter based on ML estimates) coincide. If we fix £ = 0.1 as suggested above, the 
value a = 1 provides a Bayes factor equal to 1.03, which is quite satisfactory; on 
the other hand a = 0.5 would give a BF equal to 0.67. We also experimented 
with other values of a and did not get values of BF close to 1. 

Having set £ = 0.1 and a = 1, we now proceed to the analysis of the cancer 
data. The Bayes factor of QI against SZ is equal to 0.17, which is clearly not 
supporting the hypothesis of quasi-independence. To better assess this value, it 
is useful to derive the Bayes factor against QI, which is merely the reciprocal of 
the above, and to further transform it using the logarithm in base 10. In this 
way we can make use of the scale developed by Jeffreys, see Kass and Raftery 
(1995) and Robert (2001, p. 228), for the interpretation of the evidence provided 
by a Bayes factor. Specifically, the evidence against QI is 

• poor if < log 10 BF(SZ : QI) < 0.5, 

• substantial if 0.5 < log 10 BF(SZ : QI) < 1, 

• strong if 1 < log 10 BF(SZ : QI) < 2, 

• decisive if log 10 BF(SZ : QI) > 2, 

where BF(SZ : QI) = 1/BF(QJ : SZ). As a consequence we get log 10 (l/0.17) = 
0.77 which thus represents substantial evidence against QI, essentially in accord 
with the frequentist answer which states a p- value between 2% and 3%. It is 
instructive to verify what would have been the result of a conventional Bayesian 
analysis, based exclusively on the positive-cell models SZq and QIq, as opposed 
to the model based on mixtures developed in this paper. Recall that, in the 
standard case, the BF would simply be the ratio m,Qi {n) / msz {n) . In this case 
the BF takes the value 0.55, which is appreciably higher than the value 0.17 
obtained with our analysis. More interestingly, when translated to the Jeffreys 
scale, we obtain log 10 (1/0.55) = 0.26 which only represents poor evidence against 
QI, which is an order of magnitude lower, on the Jeffreys scale, than the one we 
obtained with our analysis. 
4. Discussion 
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In this paper we have presented a new methodology for the Bayesian analysis 
of contingency tables that allows explicitly for the possibility of zero-probability 
cells. 

The essential features of our approach are: the notion of extended log-linear 
model, the support of computational algebraic geometry to enumerate and list all 
model-instances having varying support, the use of a mixture model to represent 
the sampling distribution of the cell-counts, a technique to assign prior probabili- 
ties to the various model-instances, a method to derive prior distributions on the 
parameter space of each model-instance starting from a Dirichlet prior under the 
structural zero model, as well as an elicitation procedure for the corresponding 
hyper-parameters . 

We have illustrated our methodology by means of an application to a real 
data set involving a cross-classification of types of cancer and gender. The corre- 
sponding contingency table presents two structural zeros, and a cell with a zero 
count. The results we obtain, when testing the hypothesis of quasi independence, 
show that our methods can lead to conclusions that are substantively different 
from those based on a standard modeling analysis, which does not explicitly allow 
for the possibility of zero-probability cells. 

In order to apply the algebraic Bayesian approach presented in this paper 
to large and sparse contingency tables, we believe that a purely "automated" 
approach can be expected to run into serious computational issues, although 
technology is rapidly evolving in this area as for instance evidenced, within the 
field of Maximum Likelihood Estimation, in the recent paper by Erikkson et al. 
(2006), see also Patcher and Sturmfels (2005) for a variety of high-dimensional 
applications. A careful choice of prior distribution is often the only sensible way 
to make the analysis viable, see for instance Diaconis and Rolles (2006) in the 
context of Markov chains with forced zeros. We therefore believe that a blend of 
computational algebraic methods and prior information on the set of possibly- 
zero probability cells is likely to be the best option for the analysis of moderate 
to large multi-way tables. 
Appendix 

We summarize below some useful facts about the Dirichlet distribution, see 
e.g. Bernardo and Smith (1994, pp. 134-5) (notice however that our notation is 
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slightly different from theirs). 

Lemma 1 Let 9 = (9±, . . . , 9 S ), with < 0& < 1, k = 1, . . . , s, and J2k=i ®k = 1- 
Assume that 6 ~ Di(a), mi/t a = (ai, . . . , a s ) and > 0. 

i) 

I 6>i, . . . , 9 r , (1 - ^ ^) j ~ Di I ai, . . . , a r , ^ aj J , r < s. 

V Z=r+1 / V l=r+l J 

ii) Let 6' m = jj m e , m = 1, . . . , r, r < s, then 

(e[,...,9' r )~Di( ai ,...,a r ), 
and (6'i, . . . , 0' r ) is independent of (9 r +i, ■ ■ ■ , 9 S ). 
Hi) Let 9\ = 6»i + . . . + 9 h , 9* t = 9 it _ x + . . . + 9 S , l<t<s, then 

(9* 1 ,...,9* t )^Bi(a* 1 ,...,a* t ), 
a* = a>\ + . . . + oti l} . . ., ol% = a it l + . . . + a s . 
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